Search CORE

18 research outputs found

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Author: Dally William J.
Han Song
Horowitz Mark A.
Liu Xingyu
Mao Huizi
Pedram Ardavan
Pu Jing
Publication venue
Publication date: 03/05/2016
Field of study

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly: https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision: http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University: https://goo.gl/6lwuer. Published as a conference paper in ISCA 201

arXiv.org e-Print Archive

Crossref

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

Author: Andreas Gerstlauer
Ardavan Pedram
Robert A. van de Geijn
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Algorithm/architecture codesign of low power and high performance linear algebra compute fabrics

Author: Pedram Ardavan
Publication venue
Publication date: 27/09/2013
Field of study

textIn the past, we could rely on technology scaling and new micro-architectural techniques to improve the performance of processors. Nowadays, both of these methods are reaching their limits. The primary concern in future architectures with billions of transistors on a chip and limited power budgets is power/energy efficiency. Full-custom design of application-specific cores can yield up to two orders of magnitude better power efficiency over conventional general-purpose cores. However, a tremendous design effort is required in integrating a new accelerator for each new application. In this dissertation, we present the design of specialized compute fabrics that maintain the efficiency of full custom hardware while providing enough flexibility to execute a whole class of coarse-grain operations. The broad vision is to develop integrated and specialized hardware/software solutions that are co-optimized and co-designed across all layers ranging from the basic hardware foundations all the way to the application programming support through standard linear algebra libraries. We try to address these issues specifically in the context of dense linear algebra applications. In the process, we pursue the main questions that architects will face while designing such accelerators. How broad is this class of applications that the accelerator can support? What are the limiting factors that prevent utilization of these accelerators on the chip? What is the maximum achievable performance/efficiency? Answering these questions requires expertise and careful codesign of the algorithms and the architecture to select the best possible components, datapaths, and data movement patterns resulting in a more efficient hardware-software codesign. In some cases, codesign reduces complexities that are imposed on the algorithm side due to the initial limitations in the architectures. We design a specialized Linear Algebra Processor (LAP) architecture and discuss the details of mapping of matrix-matrix multiplication onto it. We further verify the flexibility of our design for computing a broad class of linear algebra kernels. We conclude that this architecture can perform a broad range of matrix-matrix operations as complex as matrix factorizations, and even Fast Fourier Transforms (FFTs), while maintaining its ASIC level efficiency. We present a power-performance model that compares state-of-the-art CPUs and GPUs with our design. Our power-performance model reveals sources of inefficiencies in CPUs and GPUs. We demonstrate how to overcome such inefficiencies in the process of designing our LAP. As we progress through this dissertation, we introduce modifications of the original matrix-matrix multiplication engine to facilitate the mapping of more complex operations. We observe the resulting performance and efficiencies on the modified engine using our power estimation methodology. When compared to other conventional architectures for linear algebra applications and FFT, our LAP is over an order of magnitude better in terms of power efficiency. Based on our estimations, up to 55 and 25 GFLOPS/W single- and double-precision efficiencies are achievable on a single chip in standard 45nm technology.Electrical and Computer Engineerin

Crossref

Texas ScholarWorks

Combined Intratympanic and Systemic Steroid Therapy for Poor-Prognosis Sudden Sensorineural Hearing Loss

Author: Ardavan Tajedini
Pedram Borghei
Shima Arastou
Publication venue: Mashhad University of Medical Sciences
Publication date: 01/12/2012
Field of study

Introduction: The aim of this study was to evaluate the efficacy of combined intratympanic and systemic steroid therapy compared with systemic steroid therapy alone in idiopathic sudden sensorineural hearing loss (ISSNHL) patients with poor prognostic factors. Materials and Methods: Seventy-seven patients with sudden sensorineural hearing loss (SSNHL) who had at least one poor prognostic factor (age greater than 40 years, hearing loss more than 70 db, or greater than a 2-week delay between the onset of hearing loss and initiation of therapy) were included in this study. Patients were randomized to the intervention group (combined intratympanic and systemic steroid therapy) or the control group (systemic steroid therapy alone). All patients received oral treatment with systemic prednisolone (1 mg/kg/day for 10 days), acyclovir (2 g/day for 10 days, divided into four doses), triamterene H (daily), and omeprazole (daily, during steroid treatment), and were advised to follow a low salt diet. The intervention group also received intratympanic dexamethasone injections (0.4 ml of 4 mg/ml dexamethasone) two times a week for two consecutive weeks (four injections in total). A significant hearing improvement was defined as at least a 15-db decrease in pure tone average (PTA). Results: Among all participants, 44 patients (57.14%) showed significant improvement in hearing evaluation. More patients showed hearing improvement in the intervention group than in the control group (27 patients (75%) versus 17 patients (41.4%), respectively; P = 0.001). Conclusion: The combination of intratympanic dexamethasone and systemic prednisolone is more effective than systemic prednisolone alone in the treatment of poor-prognosis SSNHL

Directory of Open Access Journals